Computational prediction of protein interactions in single cells by proximity sequencing

Proximity sequencing (Prox-seq) simultaneously measures gene expression, protein expression and protein complexes on single cells. Using information from dual-antibody binding events, Prox-seq infers surface protein dimers at the single-cell level. Prox-seq provides multi-dimensional phenotyping of single cells in high throughput, and was recently used to track the formation of receptor complexes during cell signaling and discovered a novel interaction between CD9 and CD8 in naïve T cells. The distribution of protein abundance can affect identification of protein complexes in a complicated manner in dual-binding assays like Prox-seq. These effects are difficult to explore with experiments, yet important for accurate quantification of protein complexes. Here, we introduce a physical model of Prox-seq and computationally evaluate several different methods for reducing background noise when quantifying protein complexes. Furthermore, we developed an improved method for analysis of Prox-seq data, which resulted in more accurate and robust quantification of protein complexes. Finally, our Prox-seq model offers a simple way to investigate the behavior of Prox-seq data under various biological conditions and guide users toward selecting the best analysis method for their data.


Introduction
Advances in single cell sequencing have enabled unprecedented analyses of cellular heterogeneity in complex biological systems [1,2].Single-cell RNA sequencing [3] (scRNA-seq) is among the most widely used methods.However, because proteins are the effector molecules for the majority of biological functions, RNA data alone is not sufficient to investigate these protein functions thoroughly.Signaling events, for example, typically begin with receptor clustering, protein phosphorylation, and other protein-protein interactions, all of which occur prior to transcription.
To investigate the roles of protein interactions in greater depth, we recently developed a method called proximity sequencing (Prox-seq) for simultaneous quantification of mRNA, surface proteins and protein complexes at the single-cell level [4] Prox-seq captures protein complex information in barcoded DNA oligonucleotides (oligos) using a proximity ligation assay [5,6] (PLA).Each protein in Prox-seq is targeted by two DNA-conjugated antibodies, called Prox-seq probes A and B (Fig 1a).The DNA oligos on probes A and B are ligated only if two protein molecules are sufficiently close to each other.The result of this ligation is referred to as a "PLA product."The ligation distance is expected to be 50-70nm [7].In order to generate a PLA product, the oligo belonging to a Prox-seq probe A must ligate to the oligo belonging to a Prox-seq probe B. Importantly, unligated probes do not contribute to the signal because both library preparation and sequence alignment require barcodes from both the A and B probe.Upon sequencing, the number of PLA products can be determined by counting the number of unique molecular identifiers (UMIs).Because of this design, the number of PLA products measured for a protein is a reflection of both the abundance of that protein and the availability of nearby Prox-seq probes.By combining Prox-seq with scRNA-seq, these PLA products can be sequenced alongside complementary DNA (cDNA) libraries, providing information on gene expression, protein abundances, and protein complex formation from single cells [4].
Prox-seq protein data contains a unique source of background noise, namely the ligation of two protein molecules that do not functionally interact but are nevertheless sufficiently close to each other by random chance.We call this effect "proximity noise" (Fig 1b).Proximity noise exists because the average distance between probes on the cell surface decreases with increasing protein abundance (see Methods).A previous study showed that proximity noise led to false positive detection of protein interactions for in situ PLA [8].A theoretical model showed that the mean amount of proximity noise is proportional to the product of the expression levels of the two proteins that made up the PLA product (Fig 1c).In short, the presence of PLA products for a specific pair of proteins does not guarantee that the two proteins functionally interact and form stable complexes.
To account for PLA products generated by proximity noise, we previously proposed and used a statistical method, termed the iterative method, to differentiate protein complexes from random ligation in PLA product counts [4].Initially, this method establishes an "expected value" for each PLA product, representing the number of PLA products that would exist if Prox-seq probes were randomly distributed across the cell surface.Subsequently, the method subtracts the expected background from the PLA product counts.If a PLA product's count exceeds its expected value, the difference between observed and expected PLA products is attributed to non-random protein complexes.This procedure is iteratively executed for every type of PLA product in each individual cell.Although this method successfully recovered positive controls of known protein complexes, assessing its performance on experimental data is challenging, as Prox-seq datasets lack comprehensive knowledge of the entire set of protein complexes and their expression levels.
In this study, we present a simulation model for single-cell proteomic data in proximity sequencing experiments and use it to computationally benchmark the performance of several new and existing protein complex prediction methods.After calibrating the model with experimental data, the simulation model allowed us to quantitatively analyze the proximity noise and its effects on the measured PLA product counts.We compare the performance of three methods: the iterative method, a new linear regression-based method, and a new ensemble method that combines the two.We find that, while both iterative and linear regression-based methods perform well in several different scenarios, combining them into the ensemble method yielded the most accurate and robust quantification of protein complexes.These results shed insight onto how the co-localization of surface proteins translate into Prox-seq data and provides guidelines for use of Prox-seq and related dual-binding technologies for multi-omic analysis of single cells.

Overview of the simulation model
Based on a physical model of how PLA products are formed in each single cell, we created a simulation model of PLA product count data.We reasoned that proximity alone would determine if a Prox-seq probe A and a Prox-seq probe B ligate and produce a PLA product.We constructed the simulation model in a way that allowed us to simulate probes that bind to noninteracting protein molecules (proteins that are not part of a complex) separately from probes that bind to interacting molecules (proteins that are part of a complex).This procedure enabled us to independently tune the abundance of proteins and protein complexes in the simulation, and to observe how these properties affected Prox-seq data.
First, we generated the non-interacting Prox-seq probes A as random points on a sphere (Fig 2a).These points indicated that the protein molecules exist as monomers, that their complex partners were not targeted by the Prox-seq probe panel, or that they were caused by nonspecific antibody binding.Further, we assumed such protein monomers were distributed randomly on the cell surface.Then, we repeated the process to generate the non-interacting Proxseq probes B signal.Second, we generated the interacting Prox-seq probes A and B by generating a sphere of random points.These points corresponded to detectable protein complexes.Because these two probes A and B both bound to the same protein complexes, the Prox-seq probe A points would necessarily be in proximity with their corresponding Prox-seq probe B points.Finally, any pairs of probe A and B with Euclidean distances less than the ligation distance were considered ligated and produced PLA products (see Methods).If a probe A was within the ligation distance with more than one probes B, one such probe B was chosen at random to ligate with said probe A.
We next compared the simulated Prox-seq data to the experimental data.We analyzed T cells (Jurkat cell line) and B cells (Raji cell line) with a panel of Prox-seq probes that targeted both T cell and B cell markers from a previously reported study [4].Simulated counts of PLA product and protein expression followed the Poisson distribution, whereas the experimental data exhibited overdispersion (S1a and S2a Figs).We found that adding variance in the form of a negative binomial distribution (NB) for non-interacting probes and protein complexes was sufficient to capture the overdispersion of the real data (NB variance, see Methods).With the added NB variance, the simulated data, like the experimental data, had a right-skewed distribution across different PLA product abundances (Fig 2b and 2c).Notably, the simulation model with added variance captured the positive correlation between observed PLA product count and non-proximal probe count in real data (S1b-S1g Fig) .The simulation model with no variance, however, showed a negative correlation between PLA product count and nonproximal probe count (S1d and S1e Fig) .The NB variance model also produced non-proximal probe counts with similar distributions to those observed in experimental data (S2 Fig) .We generated replicated datasets by sampling from the fitted model for posterior predictive checks (PPCs) [9].We then assessed how well these data samplings maintained the properties of the observed data with two metrics.First, we measured the similarity between the coefficient of variation per PLA product, proximity noise, protein complex and protein.This comparison enables evaluation of how well the mean-variance relationship of real data is preserved (Fig 2d & S3a Fig).Second, we perform Mann-Whitney U-test statistic to measure the extent to which the replicated data and raw data come from the same distribution (S3b Fig) .Finally, we characterized the amount of proximity noise in the most basic scenario when there were no protein complexes detectable by the Prox-seq probe panel.The simulation demonstrated that the amount of PLA product produced by random ligation scales quadratically with both protein abundance and ligation distance (Fig 2e).These results show that our model and simulations faithfully capture key aspects of real Prox-seq data in single cells and reiterates the importance of identifying and removing proximity noise, which can especially be large for highly expressed proteins.

Simulation of non-specific antibody binding in Prox-seq and heterogenous cell clustering by simulated PLA data
Nonspecific antibody binding occurs when an antibody binds to a cell that does not have an epitope for that antibody.This is a potential problem encountered in every antibody-based proteomics technology.The challenge of nonspecific staining becomes more complicated in Prox-seq due to its reliance on dual-binding events.For a PLA product, it can be categorized into three possible binding cases: nonspecific binding (both binding events of probe A and probe B are nonspecific), one-specific binding (only one probe is bound to its target), and both-specific binding (both probes are bound to their target).Clearly, we only desire both-specific binding PLA data for downstream analysis.To estimate nonspecific antibody binding within experiments, we include isotype control antibodies with oligonucleotide conjugation in both probe panels.This allows us to directly define the nonspecific binding distribution from the observed data.Given a probability of nonspecific binding for each antibody, we can use the simulation model to generate PLA data that recapitulates the properties of three binding cases in real data (S4 Fig) .We found that the issue of nonspecific antibody binding is negligible in current extracellular Prox-seq assays and is much less important than proximity noise.For example, for a highly abundant PLA product CD147:CD147 in Jurkat cells, nonspecific PLA counts constitute less than one percent of both-specific binding PLA counts (S4 Fig).However, we anticipate that the significance of nonspecific antibody binding may escalate in intracellular Prox-seq experiments [10], which necessitate the analysis of denoising nonspecific binding before reliably predicting protein interactions.While our current extracellular assays are not heavily impacted by nonspecific binding, the potential challenges posed in intracellular experiments underscore the importance of refining and validating denoising methods for comprehensive and accurate analysis in future studies.Given the limited data availability of PLA datasets, our simulation model and synthetic datasets can be served as crucial tools to benchmark background staining denoise models for Prox-seq.
PLA data from Prox-seq introduces a new modality to single-cell omics research.In our simulation, we generated three distinct PLA datasets, each representing a unique cell type characterized by same protein expressions but different protein complex expressions.All three cell types express same abundance of proteins 1, 2, and 3.In specific details, cell type 1 exhibits the presence of the protein 1:2 heterodimer and the protein 1 homodimer.Conversely, cell type 2 exclusively expresses the protein 2:3 heterodimer.Lastly, cell type 3 features the protein 1:3 heterodimer along with the protein 1 homodimer.Through unsupervised clustering based on PLA features, we observed a high correlation with the known characteristics of the cell types (S5 Fig) .This simple simulation study underscores the potential of utilizing PLA data to identify cell types by their protein complex arrangement.

Iterative prediction of protein complex abundance
An iterative method was used to previously identify the existence of stable protein complexes in Prox-seq measurements.This method proposed that when there were no protein complexes, the observed count of a PLA product i:j could be calculated from the abundance of the probe A targeting protein i, and the probe B targeting protein j (see Methods).This calculation resulted in an expected random count for PLA products that represents the PLA count caused by proximity noise.We reasoned that if the observed count of PLA product i:j was higher than the calculated expected random count, then i:j indicated a non-random protein interaction.To quantify the protein complexes on each single cell, we calculated the difference between the observed and expected random PLA product count (Fig 3a).This method was called the iterative method, because it involved solving a system of quadratic equations (describing all possible protein dimers) iteratively (see Methods) [4].This method relied on the fact that Prox-seq can measure protein abundance, similar to flow cytometry and CITE-seq [11].The abundance of a protein was the amount of protein molecules that were present on the cell surface, and therefore included both molecules in monomeric and complex forms.In our previous study [4], we proposed that the protein abundance could be estimated from Prox-seq data by summing the appearances of each protein across its associated PLA products (see Methods).Here, we find by using our simulated data that such an estimate is a good approximation of the true protein abundance, as they are strongly correlated (S6 Fig).
To further examine the assumptions underlying the iterative method, we now construct the following simulation scenario: The simulation had three protein targets, called protein 1, protein 2 and protein 3.These proteins did not interact with themselves, nor with any other proteins.Furthermore, protein 3 had a lower non-interacting probe count (mean of 100 UMIs/ cell compared to 1000 UMIs/cell for proteins 1 and 2, S1 Table ).Simulated data showed that our assumptions behind the iterative method were correct.When there were no interactions between the proteins, the observed PLA product counts were similar to the expected random count (S7a Fig) .When we introduced the protein complex 1:1 to the simulation while keeping the other parameters the same, the observed counts of the PLA product 1:1 was higher than its expected random count (S7b Fig).
One weakness of the iterative method is complexity of hyper-parameter tuning, which can result in sub-optimal convergence.They key parameter is the initialization setting, which are the initial estimates of protein complex abundances.By default, the algorithm assigns and initial value of 0 to all protein complexes.However, different initialization settings will influence iterative behaviors to convergence, as well as tolerance (S8a Fig)  In the iterative method, the protein complex count is the difference between the observed and expected PLA product count.In the LR method, the protein complex count is the difference between the observed PLA product count and its expected amount of random ligation, which is calculated from the non-proximal probe count.In (a), the red line indicates y = x.In (b), the orange line indicates the linear regression fit.(c) Heatmap showing the mean complex count of simulated data, and of the iterative and LR methods' prediction results.(d) Heatmap showing the fraction of cells expressing a protein complex, as predicted by the iterative method, the LR method, and Fisher's exact test.In (c, d), the true count represents the ground truth of protein complex count in the simulation.(e, f) Scatter plots showing the simulated and predicted count of protein complex 1:1 using (e) the iterative and (f) the LR method.(g) Scatter plot comparing the predicted count of protein complex 1:1 from the iterative and the LR methods.In (e-g), the red lines indicate y = x, and each dot represents a single cell.https://doi.org/10.1371/journal.pcbi.1011915.g003

Prediction of protein complex abundance using linear regression
To address the limitations of the iterative method we developed a new approach (the linear regression-LR method).This method uses an experimentally modified Prox-seq procedure that enables direct measurement of Prox-seq probes that were not ligated because they were not proximal to another Prox-seq probe (we refer to these as non-proximal probes) [4].The proximity noise for a PLA product i:j should be proportional to the product of the non-proximal probe A targeting protein i, and the non-proximal probe B targeting protein j.We reasoned that if linear regression is used to model the observed PLA product count onto the estimated proximity noise amount, true protein complexes would have positive intercepts (see Methods).The slope was then used to estimate the amount of proximity noise, and the count of a protein complex was calculated by subtracting the estimated proximity noise from the observed PLA product count (Fig 3b).Experimentally, we observed strong heteroscedasticity in the PLA product count when regressed on to the proximity noise amount (S9 Fig) .Therefore, we performed linear regression using weighted least squares instead of ordinary least squares (see Methods).
We created a new simulation to directly compare the iterative and LR methods.The simulation's parameters were set to approximate the experimental data.More specifically, the simulation had three protein targets: protein 1, protein 2 and protein 3. Proteins 1 and 2 interacted both with themselves and each other (Fig 3c , S1 Table).Protein 3 did not interact with itself, nor with protein 1 or protein 2. Furthermore, protein 3 had very low non-interacting protein count (mean of 2 UMIs/cell compared to 20 and 15 for proteins 1 and 2, respectively).We found that the iterative method correctly identified protein complexes 1:1, 1:2, 2:1 and 2:2 (Fig 3c).
To determine if we can statistically infer the enrichment of PLA products, we performed a one-sided Fisher's exact test on the counts of PLA products (Fig 3d, see Methods).This analysis correctly identified the four protein complexes present in the sample, independently confirming that the generated protein complexes occur at a higher frequency than random and can be statistically inferred (Fig 3d, see Methods).With regards to quantification of protein complexes on single cells, we observed that the iterative method consistently underestimated the true protein complex count (Fig 3c and 3e).Conversely, the LR method not only correctly identified the four true protein complexes (complexes 1:1, 1:2, 2:1 and 2:2), but also produced much more accurate counts for them (Fig 3c , 3d and 3f).Overall, the results of the two methods were correlated on the single-cell level (Fig 3g).

Ensemble method that combines both the LR and iterative methods for analysis of Prox-seq data
We next chose to explore a method that had the potential to outperform both the LR and iterative methods.As shown previously, the major weakness of the iterative method is its sensitivity to initialization conditions.We reasoned that the output from the LR method could be used as a sensible initialization for the iterative method (Fig 4a).Starting the iteration close to the correct result would make it less likely that the method would fall into a spurious local optimization.The performance of all three methods was compared in two simulations: one in which a high percentage of proteins were in complex with other proteins (high signal-to-noise) and one in which a low percentage of proteins were in complex (low signal-to-noise) (S1 Table ).
The iterative method performed well when signal was high, but generated false positives when signal was low (Fig 4b and 4c).The LR method performed better in the low signal-tonoise simulation but suffered from false positives when noise was low (Fig 4c).This is not surprising because LR method depends on performing regression with product of non- proximal probes as the explanatory variable, and the LR method will become unstable if there are few non-proximal probes across all single cells (or low noise in our simulation).Both methods consistently underestimated the abundance of protein complexes.For the iterative method, this is partly because expected PLA count we assumed is the maximal proximity noise it might have.The slope we use to quantify protein complexes from LR method tends to be larger than correct one because counts of non-proximal probes we can measure are inevitably lower than real counts both in experiment and simulation, which would give us a smaller positive intercept and protein complex count.In contrast, the ensemble method was able to maintain strong performance in both scenarios.It was less likely to produce a false positive, assigned fewer reads to false positives than other methods, and was closer to the true count for most of the protein complexes (Fig 4b and 4c).Finally, for a given PLA product, the ensemble method was more accurate in quantifying the abundance of true-positive complexes in single cells (Fig 4d ).

A quantitative scoring strategy to comprehensively evaluate prediction methods
To evaluate the predictive performance of these methods more comprehensively, we further propose a quantitative scoring strategy to assign a prediction score for every prediction (S10a  ).To further benchmark the performance of the three methods, we expanded our test cases to eight hundred.These tests are categorized into eight biological scenarios, mirroring the structure of the previous simulation.The scenarios include cases of only heterodimer, only homodimer, one overabundant protein, and multiple protein dimer situations.Each scenario is further divided into binary cases, featuring both high signalto-noise ratios (complex abundance to monomer abundance is 10:1) and low signal-to-noise ratios (complex abundance to monomer abundance is 1:10).The input for the simulation is randomly sampled from a generator under a specific distribution, repeated 100 times within each scenario (S11e and S11f Fig) .The results strongly support our earlier conclusion that the LR method excels in low signal-to-noise situations, whereas the iterative method exhibits unstable performance.The iterative method performs better in predicting only heterodimer or homodimer situations, while LR demonstrates greater robustness in handling more complex scenarios involving multiple protein dimers or overabundant proteins.While the iterative and LR methods each had regimes where they underperformed, the ensemble method consistently performs well across each scenario, making it a reliable choice for typical situations in which the true biological conditions are uncertain (S11a, S11b, S11c and S11d Fig).

Comparison of all three analytical methods to real data and performance evaluations
Next, we evaluated the concordance between all three methods on experimental data from single Jurkat and Raji cells.Overall, we found that each method largely agreed on which PLA products were predicted to be protein complexes (Fig 5a -5f).While the bulk measurements of protein complexes showed good agreement between methods, the three methods had varying levels of correlation for single cells (Fig 5g and 5h).In addition, we observed all three methods, along with the Fisher's Exact test, largely identified the same protein complexes (S9 Fig) .All methods predicted protein complexes CD3:CD3 and CD28:CD28 in Jurkat cells, both of which are known protein complexes [12,13].All three methods also predicted protein complex ICAM1:ICAM1 in Raji cells, which was shown to dimerize on the cell surface [14].We also evaluated our methods against a simulation designed to more closely represent the experimental data.Protein expression levels were estimated from the experimental data and used to create simulation models for Jurkat and Raji cells (S1 Table ).Then, protein complexes corresponding to CD3:CD3, CD28:CD28, and CD3:CD28 were added to Jurkat cells, whereas HLADA:HLADR and PDL1:PDL1 were added to Raji cells.Once again, we observe largely similar performance for all methods (S10 Fig).

Discussion
Here, we presented a comprehensive computational framework for simulating Prox-seq data, and for predicting protein complex count from Prox-seq data.We studied how the quantification of protein complexes was affected by proximity noise, which is caused by proteins that are not functionally interacting but are sufficiently close to each other by random chance to produce valid ligation products.Our simulation model showed that the amount of proximity noise is strongly depended on the protein abundance.Similar results have been observed in commercial in situ PLA [8].
We showed that with respect to protein complex prediction, the iterative method, LR method, and ensemble method largely agree on real experimental data.Therefore, we propose that each of these methods could be used for protein complex detection and quantification, and any protein complexes that were predicted by these methods were highly likely to be true protein complexes.However, in head-to-head comparisons using simulated data, the ensemble method performed well over a larger range of data types than the other methods (Table 1).
Our simulation model had some limitations.First, it did not consider interactions higher than dimers, diffusion of the protein molecules, their physical sizes, and the technical variability of the Prox-seq assay.It is important to note that our model will consider two proteins to have interaction if they are a part of a higher-order protein complex, even in the absence of direct physical contact, since they need only be within the designated interaction range determined by the proximity ligation distance.Second, the simulation model requires the user to independently select the abundance of a protein complex and its constituents' non-interacting counterpart.In real cells, these abundances are likely highly correlated.Finally, it assumed that the protein complexes and the non-interacting proteins were uniformly distributed on the cell surface.Despite these limitations, we showed that the overall structure of simulated Prox-seq data is very similar to real Prox-seq data.
Currently, application of each method requires a relatively homogeneous population of single cells.In practice, this requires that simultaneously acquired mRNA data is first used to cluster cell types, and then either method can be applied to individual clusters.This requirement is Table 1.Comparison of features between three predictive methods.

Iterative method LR method Ensemble method
Mechanism Approximate the count of protein complexes by iteratively solving multiple quadratic equations.Protein complex count is the difference between observed PLA counts and expected PLA counts.
Construct a weighted least square model with non-proximal probes as independent variables and observed PLA count as response.
Extrapolate the protein complex count based on the difference between observed PLA count and predicted proximity noise.
Bridge the iterative method and LR method by transferring the output of LR as the initial values into iterative method.This can make iteration go in a sensible part of the space that is likely to produce a good solution.

Features
Expected PLA count is calculated by multiplying the joint probability of simultaneously observing specific antibodies from two probes with total observed PLA counts.No additional experiments and information needed.
Predicted proximity noise or random ligation counts of PLA is calculated by multiplying the fitted slope coefficient with the product of nonproximal probe counts.Free oligo modification [4] to experiment is required to measure non-proximal probe count.
A combination of iterative method and LR method.LR method should be applied in advance to perform ensemble method.Free oligo modification [4] to experiment is required.The ensemble approach effectively enhances the generalization of the iterative method and LR method across different biological scenarios.

Applicable situations
Simpler biological scenarios where there are only homodimers or only heterodimers.
Low signal-to-noise situation and more complicated scenarios where there are multiple protein dimers or overabundant proteins.
Robust and consistent across various biological scenarios.

Limitations
Can be very unstable when applied to complicated scenarios where there are multiple protein dimers or overabundant proteins.
Not optimal for scenarios of high signal-tonoise, only homodimers, and only heterodimers.
Additional experimental procedure is required to measure non-proximal probe count.
Additional experimental procedure is required to measure non-proximal probe count. https://doi.org/10.1371/journal.pcbi.1011915.t001 because each method relied on a statistic of the whole population (the difference between observed and expected random PLA product count for the iterative method, and the linear regression's intercept and slope coefficient for the LR method) and having different complex expression levels would lower the power the methods.Further study is required to extend these methods to a population of heterogeneous cell types without the use of mRNA data.We envision that the Ensemble method will be particularly useful when Prox-seq is extended to intracellular proteins.Indeed, since non-specific antibody binding is much more severe in intracellular staining than extracellular staining, random ligation is an even more important source of noise given common macromolecular crowding effect within cells.The simulation model can also be further extended to model Prox-seq data of intracellular proteins.In short, we have validated the protein complex prediction algorithm that was proposed previously [4], proposed two additional independent methods for protein complex prediction, and introduced a model for simulating Prox-seq data.

Theoretical calculation of proximity noise
Suppose there are A i probes A and B j probes B on the cell surface.Assume that the probes are random points on a spherical surface, and proteins i and j do not interact.Because the ligation distance is significantly shorter than the cell's radius, we assume that a probe A and a probe B can be ligated if and only if the Euclidean distance between them, L, is less than or equal to the ligation distance, d ligation .The Euclidean distance L between any pair of random points has the following probability distribution [15]: where R is the cell radius.
Then, the probability of ligation between two random points on the cell surface is: Assume that each probe could be ligated as many times as possible, the mean counts of ligated PLA product i:j, X i,j , follow a binomial distribution: The expected count of PLA product that is created from random ligation of non-interacting probes is: Note that this approximation assumes that each probe can be ligated many times, while the simulation model assumes that each probe can only be ligated at most once.Experimentally, each probe can only be ligated 3-7 times, depending on the number of DNA oligomers per probe.As a result, this estimate represents the upper limit of the random ligation amount.
The simulation is performed separately on each single cell.For the single cell t, we first generate A ðtÞ i number of random points on a sphere surface, which correspond to the number of detected probe A that targets protein i on cell t.The coordinates of each point are [16]: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where R is the radius of the sphere (taken to be 5 μm, or 5000 units, in our study), u is uniformly distributed over [-1,1), and θ is uniformly distributed over [0,2π).Without added variance, A ðtÞ i ¼ A i .With added negative binomial variance: where n NB = 1.5 in our study, and . The negative binomial distribution formulated this way provides the probability of getting A ðtÞ i failures, given n NB successes and p NB is the probability of success.n NB is used to control the variance of the probe count, and p NB is calculated such that the mean of A ðtÞ i is equal to A i .The choice of n NB value here is based on observation of experimental data.While different PLA products have different best-fitted n NB values, we take the mean value of n NB fitted for various PLA products on either Jurakt T cells or Raji B cells for our standard simulation (S3c Fig) .n NB is a flexible parameter to change in the simulation.
Second, we randomly generate B ðtÞ i number of points on a surface of a sphere, which correspond to the number of detected probe B that targets protein i on cell t.The coordinates of each point are generated identically to above.
Without added variance, B ðtÞ i ¼ B i .With added variance: This is to ensure that the counts of detected probe A and probe B that target the same protein are proportional to each other.
Third, we randomly generate c ðtÞ i;j number of points on a surface of a sphere, which correspond to the count of protein complex i:j on cell t.Then, these c ðtÞ i;j points are added to the previously generated probe A points targeting protein i A ðtÞ i , and also to the previously generated probe B targeting protein j B ðtÞ j .Without added variance, c ðtÞ i;j ¼ c i;j .With added variance: where n NB = 1.5 in our study, and Fourth, we calculated the pairwise Euclidean distances between all generated probe A points and all generated probe B points.Finally, we randomly go through the pairs of points that are within a ligation distance threshold (chosen to be 50 nm, or 50 units, in our study), and add the corresponding PLA product to the simulated count matrix.Any probe A and probe B points that are chosen are excluded from future PLA products.In other words, each probe A and each probe B can only be ligated at most once.
The number of probe A and probe B points that are not ligated are returned as the simulated non-proximal probe count that is measured by the free oligo modification.
The simulation is repeated 100 times to simulate PLA product counts of 100 single cells.The parameters for all simulations are listed in S1 Table .All simulations include negative binomial variance, unless stated otherwise.

Calculation of protein count and expected PLA product count
The count of a protein i in a single cell is equal to the total number of detected PLA products that are related to the protein i: where X i,l and X k,i indicate the observed (i.e., measured) counts of PLA products i:l and k:i, respectively.The PLA product i:i is counted twice to the protein count to account for the fact that two molecules are present in a homodimer.
The expected count of a PLA product i:j is:

Protein complex prediction: iterative method
The count of protein complex i:j is calculated iteratively using the following equation: where Y ðmÞ i;j is the predicted count of protein complex i:j at the m th iteration.The initial values for all protein complexes are 0.
The second term of the right hand side represents the count of PLA product i:j that is caused by random ligation.
After with Benjamini-Hochberg-corrected P-values above 0.05 are set to 0. In other words, any such PLA products were determined to not represent true protein interactions.
There is also a symmetry condition, such that if i:j is a protein complex, then j:i should also be a protein complex, even if Y ðmþ1Þ j;i fails the t-test.This is done by setting Y ðmþ1Þ j;i as a fraction of Y ðmþ1Þ i;jj : where sym_weight is arbitrarily chosen to be 1 in our study.

Protein complex prediction: linear regression (LR) method
For each PLA product i:j, its observed count is regressed onto the product of its corresponding non-proximal probe A count and non-proximal probe B count, using weighted least squares: where A 0 i and B 0 j are the count of non-proximal probe A targeting protein i, and nonproximal probe B targeting protein j, respectively.The weight for a sample (ie, a single cell) p is: For simulated data, we also scale the interaction term by 10 6 whenever necessary, such that it is close to the orders of magnitude of X i,j .A 0 i and B 0 j are obtained from PLA products that contain the added free oligos.For example, the count of non-proximal CD3 probe A is equal to the count of PLA product CD3:free_oligo_B, and the count of non-proximal CD28 probe B is equal to the count of PLA product free_oligo_A:CD28.
Next, we performed a one-sided t-test on the intercept coefficient, and the alternative hypothesis is that β 0 > β cutoff .For simulated data, β cutoff = 1.For experimental data β cutoff = 10.All PLA products with Benjamini-Hochberg-corrected P-values below 0.05 are considered to be true protein complexes.The protein complex count, Y i,j , is calculated as the difference between the observed PLA product count and the interaction term: The LR method is related to the binomial approximation of the random ligation signal above.If the counts of non-proximal probes are perfect proxies for the count of non-interacting probes, then we have the following relationship:

Protein complex prediction: Ensemble method
The ensemble method relies on solving same quadratic equations as iterative method to approximate counts of protein complex.The only difference is that it takes protein complex matrix calculated from LR method as initial values.There is an argument called df_guess embedded in predictive function which is set to be all zeros by default.Note that LR method should be applied in advance in order to perform ensemble method.

Protein complex prediction: Fisher's exact test
A one-sided Fisher's exact test is conducted on the table below (Table 2).The alternative hypothesis being tested is whether X i,j (the observed value) is significantly greater than what would be expected by chance.Following this, Benjamini-Hochberg correction is applied to the P-values obtained from all PLA products, for each individual cell.We assume there is a protein-protein interaction on a given cell if the corrected P-value falls below the threshold of 0.05.

Prediction score mechanism
Score ¼ w 1 * where w 1 , w 2 , w 3 are chosen to be 0.5, 0.4, 0.1 in our study.∑Pearson equals to the sum of Pearson correlation coefficients for every real protein complex between true complex counts and predicted complex counts across all singles cells.∑Mean deviation equals to the sum of absolute difference between mean true counts and predicted counts for every PLA product: Mean deviation ¼ jMean pred À Mean true j Mean true ∑FPrate equals to the sum of ratios of false positive prediction across single cells for every nonexisting PLA product.
For quantification accuracy evaluation where there are true protein complex counts, we consider parameters ∑Pearson and ∑Mean deviation .Pearson correlation coefficient takes single cells into consideration while means counts can give us information about bulk abundance of different PLA products.We found that poor prediction of PLA counts in single cells might still contribute to seemingly good mean counts estimation, which shed lights on us that Pearson correlation should be a more important and robust parameter than mean counts.For ∑FPrate evaluation where there is no true complex, we use fraction of complex-positive cells to represent how many ratios of single cells are wrongly assigned at least a complex count.According to our multiple tests, each method tends to assign only few false positive reads, mostly only one in some single cells to PLA products.So that we assume false positive rate a minor metric to be considered in our scoring strategy.In conclusion, we arbitrarily choose effector weight for each parameter given relative importance discussed above.

Fig 1 .
Fig 1. Working principle of Prox-seq and identification of proximity noise.(a) Schematic showing the main steps of Prox-seq.(b) Schematic showing the background in Prox-seq that is caused by proximity noise (random ligation of non-interacting protein molecules).(c) Heatmap showing the expected amount of proximity noise created from simulations of two protein molecules at varying expression levels.By modeling the mean amount of proximity noise with a binomial distribution (see Methods), we found that it was proportional to the product of the abundances of the two protein molecules.https://doi.org/10.1371/journal.pcbi.1011915.g001

Fig 2 .
Fig 2. Overview and calibration of simulated Prox-seq data.(a) Schematic for the simulation model of PLA products.The simulation was separately performed on a cell-by-cell basis.First, a number of non-interacting probes A and non-interacting probes B were added as random points on a sphere.Next, a number of protein complexes were added as random points on a sphere.These points corresponded to probes A and B that bound to interacting protein molecules.Finally, probes A and B that had a Euclidean distance lower than the ligation distance were ligated, thus creating PLA products.(b) Histograms showing the UMI counts of three example PLA products in single Jurkat cells.(c) Histograms showing the UMI counts of three example simulated PLA products with NB variance.(d) Scatter plots of mean-variance relationship show how negative binomial variance captures overdispersion in PLA data, proximity noise data, protein complex data, and protein data.(e) The relationship between proximity noise (measured as UMI counts) and protein abundance (top) or ligation distance (bottom).Please refer to the Methods section for derivation of the binomial distribution approximation.https://doi.org/10.1371/journal.pcbi.1011915.g002 Unsensible initialization tends to generate nonsensical predictive outputs (S8b Fig).This led us to consider more robust methods for protein complex quantification.

Fig 3 .
Fig 3. Comparison between the iterative and linear regression (LR) methods for protein complex prediction in simulated data.(a, b) Schematics showing the working principle of (a) the iterative method and (b) the LR method.In the iterative method, the protein complex count is the difference between the observed and expected PLA product count.In the LR method, the protein complex count is the difference between the observed PLA product count and its expected amount of random ligation, which is calculated from the non-proximal probe count.In (a), the red line indicates y = x.In (b), the orange line indicates the linear regression fit.(c) Heatmap showing the mean complex count of simulated data, and of the iterative and LR methods' prediction results.(d) Heatmap showing the fraction of cells expressing a protein complex, as predicted by the iterative method, the LR method, and Fisher's exact test.In (c, d), the true count represents the ground truth of protein complex count in the simulation.(e, f) Scatter plots showing the simulated and predicted count of protein complex 1:1 using (e) the iterative and (f) the LR method.(g) Scatter plot comparing the predicted count of protein complex 1:1 from the iterative and the LR methods.In (e-g), the red lines indicate y = x, and each dot represents a single cell.

Fig 4 .
Fig 4. The ensemble method for improved analysis of Prox-seq data.We combine the iterative and LR methods for better prediction of protein complexes.(a) Schematic showing how all three methods arrive at protein complex estimation.The iterative method combines raw data and an initialization with an all-zeroes matrix to quantify protein complexes.The LR method uses raw data and free-oligo data to construct a linear regression model that quantifies protein complexes.The ensemble method begins with applying the LR workflow and uses the output of it to initialize the iterative method.(b) Comparison of all three methods in a regime of high signal and low noise, compared to the true counts.(c) Comparison of all three methods in a regime of low signal and high noise, compared to the true counts.(d) The Pearson's correlation between true counts and the outputs for each method across single cells.Each example shows complex 3:3 from the low signal/high noise regime.https://doi.org/10.1371/journal.pcbi.1011915.g004 Fig).We simulate different biological scenarios with our model and score the overall prediction performance of each method by considering sum of absolute deviation between mean true counts and predicted counts (∑Mean deviation ), sum of Pearson correlation coefficient (∑Pearson) across singles cells (S10b Fig), and sum of ratios of false positive prediction (∑FPrate) across single cells (S10c Fig) (see Methods).Comparing the methods across all scenarios showed that the ensemble method had the highest average prediction score and the lowest variance (S10d Fig & S2 Table

Fig 5 .
Fig 5. Comparison between the iterative and LR methods on experimental data.(a-c) Heatmaps showing the average of protein complex count, predicted by (a) the iterative method, (b) the LR method, and (c) the ensemble method in Jurkat cells.(d-f) Heatmaps showing the average of protein complex count, predicted by (a) the iterative method, (b) the LR method, and (c) the ensemble method in Raji cells.(g) Comparison of methods for predicting counts of protein complexes of CD28:CD28 and in Jurkat cells.(h) Comparison of methods for predicting counts of protein complexes of ICAM1:ICAM1 and in Raji cells.In (g, h), the red lines indicate y = x, and r indicates the Pearson's correlation coefficient.https://doi.org/10.1371/journal.pcbi.1011915.g005 each iteration, a one-sided t-test is performed on the values of Y ðmþ1Þ